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Abstract. Many data management applications, such as setting up Web portals, 
managing enterprise data, managing community data, and sharing scientific data, 
require integrating data from multiple sources. Each of these sources provides a 
set of values and different sources can often provide conflicting values. To present 
quality data to users, it is critical to resolve conflicts and discover values that 
reflect the real world; this task is called data fusion. This paper describes a novel 
approach that finds true values from conflicting information when there are a large 
number of sources, among which some may copy from others. We present a case 
study on real-world data showing that the described algorithm can significantly 
improve accuracy of truth discovery and is scalable when there are a large number 
of data sources. 


1 Introduction 

The amount of useful information available on the Web has been growing at a 
dramatic pace in recent years. In a variety of domains, such as science, business, 
technology, arts, entertainment, politics, government, sports, tourism, there are a 
huge number of data sources that seek to provide information to a wide spectrum 
of information users. In addition to enabling the availability of useful information, 
the Web has also eased the ability to publish and spread false information across 
multiple sources. Widespread availability of conflicting information (some true, 
some false) makes it hard to separate the wheat from the chaff. Simply using 
the information that is asserted by the largest number of data sources ( i.e ., naive 
voting) is clearly inadequate since biased (and even malicious) sources abound, 
and plagiarism (i.e., copying without proper attribution) between sources may be 
widespread. Data fusion aims at resolving conflicts from different sources and 
find values that reflect the real world. 

Ideally, when applying voting, we would like to give a higher vote to more trust¬ 
worthy sources and ignore copied information; however, this raises many chal¬ 
lenges. First, we often do not know a priori the trustworthiness of a source and 
that depends on how much of its provided data are correct, but the correctness 
of data, on the other hand, needs to be decided by considering the number and 
trustworthiness of the providers; thus, it is a chicken-and-egg problem. Second, in 
many applications we do not know how each source obtains its data, so we have 
to discover copiers from a snapshot of data. The discovery is non-trivial: sharing 
common data does not in itself imply copying-accurate sources can also share a 
lot of independently provided correct data; not sharing a lot of common data does 
not in itself imply no-copying-a copier may copy only a small fraction of data 


Table 1 . The motivating example: five data sources provide information on the affiliations of five 
researchers. Only Si provides all true values. 
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from the original source; even when we decide that two sources are dependent, 
it is not always obvious which one is a copier. Third, a copier can also provide 
some data by itself or verify the correctness of some of the copied data, so it is 
inappropriate to ignore all data it provides. 

In this paper, we present novel approaches for data fusion. First, we consider 
copying between data sources in truth discovery. Our technique considers not only 
whether two sources share the same values, but also whether the shared values are 
true or false. Intuitively, for a particular object, there are often multiple distinct 
false values but usually only one true value. Sharing the same true value does 
not necessarily imply copying between sources; however, sharing the same false 
value is typically a low-probability event when the sources are fully independent. 
Thus, if two data sources share a lot of false values, copying is more likely. Based 
on this analysis, we describe Bayesian models that compute the probability of 
copying between pairs of data sources and take the result into consideration in 
truth discovery. 

Second, we also consider accuracy in voting: we trust an accurate data source 
more and give values that it provides a higher weight. This method requires iden¬ 
tifying not only if two sources are dependent, but also which source is the copier. 
Indeed, accuracy in itself is a clue of direction of copying: given two data sources, 
if the accuracy of their common data is highly different from that of one of the 
sources, that source is more likely to be a copier. 

Example 1 . Consider the five data sources in Table [I] They provide information 
on affiliations of five researchers and only Si provides all correct data. Sources S4 
and S5 copy their data from S3, and S5 introduces certain errors during copying. 
First consider the three sources Si,S2, and S3. For all researchers except Carey, 
a naive voting on data provided by these three sources can find the correct affili¬ 
ations. For Carey, these sources provide three different affiliations, resulting in a 
tie. However, if we take into account that the data provided by Si is more accurate 
(among the rest of the 4 researchers, Si provides all correct affiliations, whereas 

52 provides 3 and S3 provides only 2 correct affiliations), we will consider UCI 
as most likely to be the correct value. 

Now consider in addition sources S4 and Sg. Since the affiliations provided by 

53 are copied by S4 and Sg, naive voting would consider them as the majority 
and so make wrong decisions for three researchers. Only if we ignore the values 
provided by S4 and Sg, we will be able to again decide the correct affiliations. 
Note however that identifying the copying relationships is not easy: while S3 
shares 5 values with S4 and 4 values with Sg, S\ and S2 also share 3 values, more 
than half of all values. If we knew which values are true and which are false, we 









would suspect copying between S 3 , S 4 and S 5 , because they provide the same 
false values. On the other hand, we would suspect the copying between Si and S 2 
much less, as they share only true values. 

The structure of the rest of the paper is as follows. Section[2]presents how we can 
leverage source accuracy in data fusion. Section[3]presents how we can leverage 
copying relationships in data fusion. Section [4] presents a case study of these 
techniques on a real-world data set, and Section[5]concludes. 


2 Fusing Sources Considering Accuracy 

We first formally describe the data fusion problem and describe how we leverage 
the trustworthiness of sources in truth discovery. In this section we assume no¬ 
copying between data sources and defer discussion on copying to the next section. 

2.1 Data Fusion 

We consider a set of data sources 5? and a set of objects 0. An object represents 
a particular aspect of a real-world entity, such as the affiliation of a researcher; in 
a relational database, an object corresponds to a cell in a table. For each object 
O £ 0, a source can (but not necessarily) provide a value. Among differ¬ 

ent values provided for an object, one correctly describes the real world and is 
true, and the rest are false. In this paper we solve the following problem: given a 
snapshot of data sources in S?, decide the true value for each object 0^0. 

We note that a value provided by a data source can either be atomic, or a set or 
list of atomic values ( e.g ., author list of a book). In the latter case, we consider the 
value as true if the atomic values are correct and the set or list is complete (and 
order preserved for a list). This setting already fits many real-world applications 
and we refer our readers to ODD for solutions that treat a set or list of values as 
multiple values. 

We consider a core case that satisfies the following two conditions (relaxation of 
these assumptions is discussed in 0 ): 

- Uniform false-value distribution: For each object, there are multiple false 
values in the underlying domain and an independent source has the same 
probability of providing each of them. 

- Categorical value: For each object, values that do not match exactly are 
considered as completely different. 

Note that this problem definition focuses on static information that does not 
evolve over time, such as authors and publishers of books, and we refer our read¬ 
ers to (§) for data fusion for evolving values. 

2.2 Accuracy of a Source 

Let 5 £ y be a data source. The accuracy of 5, denoted by A (5), is the fraction 
of true values provided by S; it can also be considered as the probability that a 
value provided by S is the true value. 

Ideally we should compute the accuracy of a source as it is defined; however, in 
real applications we often do not know for sure which values are true, especially 


among values that are provided by similar number of sources. Thus, we compute 
the accuracy of a source as the average probability of its values being true (we 
describe how we compute such probabilities shortly). Formally, let L(S) be the 
values provided by S and denote by |V(5)| the size of V ( S ). For each veV (Sj, 
we denote by P( v) the probability that v is true. We compute A(S) as follows. 
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We distinguish good sources from bad ones: a data source is considered to be 
good if for each object it is more likely to provide the true value than any par¬ 
ticular false value; otherwise, it is considered to be bad. Assume for each object 
in 0 the number of false values in the domain is n. Then, in the core case, the 
probability that 5 provides a true value is A (S) and that it provides a particular 
false value is So 5 is good if A(S) > ( i.e.,A(S) > yy_). We focus 

on good sources in the rest of this paper, unless otherwise specified. 


2.3 Probability of a Value Being True 

Now we need a way to compute the probability that a value is true. Intuitively, 
the computation should consider both how many sources provide the value and 
accuracy of those sources. We apply a Bayesian analysis for this purpose. 
Consider an object O e G. Let "V(0) be the domain of O, including one true 
value and n false values. Let S 0 be the sources that provide information on O. For 
each vgT(O), we denote by S 0 (v) C S 0 the set of sources that vote for v (S 0 (y) 
can be empty). We denote by f'(O) the observation of which value each S e S 0 
votes for O. 

To compute P(v) for vCf (O), we need to first compute the probability of 'P(O) 
conditioned on v being true. This probability should be that of sources in S 0 (v) 
each providing the true value and other sources each providing a particular false 
value: 
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Among the values in Y(0), there is one and only one true value. Assume our a 
priori belief of each value being true is the same, denoted by /3. We then have 
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P(v) = Pr(vtme\'P{0))= - SeSoi '’ ] (4) 

Iv 0 6 t(o) n SeS„(vo) l-A(S) 

To simplify the computation, we define the confidence of v, denoted by C(v), 
as C(v) = Y,seS (v) i—A(s) ■ ^ we define the accuracy score of a data source 



S as A'(S) = log , we have C(v) = Y,seS a (v)A'(S). So we can compute the 

confidence of a value by summing up the accuracy scores of its providers. Finally, 

we can compute the probability of each value as Plv) = —— 2 1 1 c , , . A value 

Li O er(o ) 2 ' 11 

with a higher confidence has a higher probability to be true; thus, rather than 
comparing vote counts, we can just compare confidence of values. The following 
theorem shows three nice properties of Equation 0. 

Theorem 1. Equation 0 has the following properties: 

1. If all data sources are good and have the same accuracy, when the size of 
S 0 (v) increases, C(v) increases; 

2. Fixing all sources in S 0 (v) except S, when A(5) increases for S, C(v) in¬ 
creases. 

3. If there exists S £ S 0 (v) such that A(S) = 1 and no S' £ S 0 (v) such that 
A(S') = 0, C(v) = +°°; if there exists S £ S 0 (v) such that A(S) = 0 and no 
S' £ S 0 (v) such that A(S') = 1, C(v) = — 

Note that the first property is actually a justification for the naive voting strategy 
when all sources have the same accuracy. The third property shows that we should 
be careful not to assign very high or very low accuracy to a data source, which 
has been avoided by defining the accuracy of a source as the average probability 
of its provided values. 

Example 2. Consider Si ,52 and S 3 in Table Q] and assume their accuracies are 
.97, . 6 , .4 respectively. Assuming there are 5 false values in the domain (i.e., 
n = 5), we can compute the accuracy score of each source as follows. For Si, 
A'(Si) = log-^f 2 = 4.7; for S 2 , A'(S 2 ) = logfif = 2 ; and for S 3 , A'(S 3 ) = 
l°g firjr = 1-5. 

Now consider the three values provided for Carey. Value l/C/thus has confidence 
8 , AT&Thas confidence 5, and BEA has confidence 4. Among them, UC1 has the 
highest confidence and so the highest probability to be true. Indeed, its probability 
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Computing value confidence requires knowing accuracy of data sources, whereas 
computing source accuracy requires knowing value probability. There is an inter¬ 
dependence between them and we solve the problem by computing them itera¬ 
tively. We give details of the iterative algorithm in Section [3] 


3 Fusing Sources Considering Copying 

Next, we describe how we detect copiers and leverage the discovered copying 
relationships in data fusion. 

3.1 Copy Detection 

We say that there exists copying between two data sources Si and S 2 if they derive 
the same part of their data directly or transitively from a common source (can be 
one of Si and S 2 ). Accordingly, there are two types of data sources: independent 
sources and copiers. An independent source provides all values independently. It 
may provide some erroneous values because of incorrect knowledge of the real 






world, mis-spellings, etc. A copier copies a part (or all) of its data from other 
sources (independent sources or copiers). It can copy from multiple sources by 
union, intersection, etc., and as we focus on a snapshot of data, cyclic copying 
on a particular object is impossible. In addition, a copier may revise some of the 
copied values or add additional values; though, such revised and added values are 
considered as independent contributions of the copier. 

To make our models tractable, we consider only direct copying. In addition, we 
make the following assumptions. 

- Assumption 1 (Independent values). The values that are independently pro¬ 
vided by a data source on different objects are independent of each other. 

- Assumption 2 (Independent copying). The copying between a pair of data 
sources is independent of the copying between any other pair of data sources. 

- Assumption 3 (No mutual copying). There is no mutual copying between a 
pair of sources; that is, Si copying from S 2 and S 2 copying from Si do not 
happen at the same time. 

Our experiments on real world data show that the basic model already obtains 
high accuracy and we refer our readers to Jb) for how we can relax the assump¬ 
tions. We next describe the basic copy-detection model. 

Consider two sources S[,S 2 E 5?. We apply Bayesian analysis to compute the 
probability of copying between Si and S 2 given observation of their data. For this 
purpose, we need to compute the probability of the observed data, conditioned 
on independence of or copying between the sources. We denote by c (0 < c < 1) 
the probability that a value provided by a copier is copied. We bootstrap our 
algorithm by setting c to a default value initially and iteratively refine it according 
to copy detection results. 

In our observation, we are interested in three sets of objects: O t , denoting the 
set of objects on which Si and S 2 provide the same true value, Of, denoting 
the set of objects on which they provide the same false value, and O d , denoting 
the set of objects on which they provide different values ( O t U Of U O d C G). 
Intuitively, two independent sources providing the same false value is a low- 
probability event; thus, if we fix O t U Of and O d , the more common false val¬ 
ues that Si and S 2 provide, the more likely that they are dependent. On the other 
hand, if we fix O/ and Of, the fewer objects on which Si and S 2 provide different 
values, the more likely that they are dependent. We denote by <J> the observation 
of O t ,O f ,O d and by k t ,kf and k d their sizes respectively. We next describe how 
we compute the conditional probability of <J> based on these intuitions. 

We first consider the case where S] and S 2 are independent, denoted by S 1 J-S 2 . 
Since there is a single true value, the probability that Si and Si provide the same 
true value for object O is 

Pr(OeO,|Si_LS 2 )=A(Si)-A(S 2 ). (5) 

On the other hand, the probability that Si and S 2 provide the same false value for 
O is 


Pr(OeO/|Si±S 2 ) 
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Then, the probability that Si and S 2 provide different values on an object O, 
denoted by P d for convenience, is 





Pr(0 € O rf |Si_LS 2 ) = 1 -A(Si)A(S 2 ) - (1 A(5l))(1 A(52)) = p d . (7) 

n 

Following the Independent-values assumption, the conditional probability of ob¬ 
serving <J> is 


/V(<f>|S,.LS 2 ) 
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We next consider the case when S 2 copies from Si, denoted by S 2 -4 Si. There 
are two cases where Si and S 2 provide the same value v for an object O. First, 
with probability c, S 2 copies v from Si and so v is true with probability A(Si) and 
false with probability 1 — A(Si). Second, with probability 1 — c, the two sources 
provide v independently and so its probability of being true or false is the same 
as in the case where Si and S 2 are independent. Thus, we have 


/V(OeO,|S 2 ->Si) =A(Si)-c+A(Si)-A(S 2 )-(1-c), (9) 

/v(oe 0/|s 2 ->Si) = (l — a (Si)) ■ c + ( 1 ~ a ( 5|) )( 1 ~ a ( 5 2)) .(i- c ) a0) 

n 

Finally, the probability that S] and S 2 provide different values on an object is that 
of Si providing a value independently and the value differs from that provided by 

S 2 : 

Pr(0 6 O d \Si —*• Si) =P d -(l —c). (11) 

We compute /V(<J>|S 2 -4 Si) accordingly; similarly we can also compute Pr($|Si —¥ 
S 2 ). Now we can compute the probability of Si_LS 2 by applying the Bayes Rule. 


/V(Si_LS 2 |<f>) 

= _ a/7-(<F|S,_L.S : ) _ 

aPr(0|Si±S 2 ) + ^FrjcFlS, -> S 2 ) + ^ Pr(d>\S 2 -4 Si)' 

Flere a = FV(Si_LS 2 )(0 < a < 1) is the a priori probability that two data sources 
are independent. As we have no a priori preference for copy direction, we set the 
a priori probability for copying in each direction as . 

Equation d 1 2t has several nice properties that conform to the intuitions we dis¬ 
cussed earlier in this section, formalized as follows. 

Theorem 2. Let 5 ? be a set of good independent sources and copiers. Equa¬ 
tion 121 has the following three properties on 5 ?. 

1. Fixing kt+kf and k d , when kf increases, the probability of copying (i.e., 
Pr(S\ —> S 2 |<J>) +/V(S 2 -4 Sil^)) increases; 

2. Fixing k t +kf + k d , when k, + kf increases and none ofk t and kj decreases, 
the probability of copying increases; 

3. Fixing k t and kf, when k d decreases, the probability of copying increases. 






Example 3. Continue with Expand consider the possible copying relationship 
between Sj and S 2 . We observe that they share no false values (all values they 
share are correct), so copying is unlikely. With a = .5 ,c = ,2,A(Si) = .91 ,A{S 2 ) = 
.6, the Bayesian analysis goes as follows. 

We start with computation of / > r(<J>|5i_L52). We have Pr(0 £ O/IS 1 -LS 2 ) = .97 * 
.6 = .582. There is no object in Of and we denote by Pj the probability Pr{0 £ 
df\S 1 ±S 2 ). Thus, Pr(<P\S l ±S 2 ) = .582 3 *P} = .2 Pj. 

Next consider Pr(<J>|5j -t S 2 ). We have Pr{0 £ O r |Sj _LS 2 ) = .8* .6 + .2* .582 = 
.6 and Pr(0 £ O f \Si -4 S 2 ) = .2 P d . Thus, Pr(^\S x -4 S 2 ) = .6 3 * (.2 P d ) 2 = 
.008 Pj. Similarly, Pr{<P\S 2 -4 SO = .028 P}. 

According to Equation (12), Pr(S ] ES 2 \<P) = . 5 ,. 2 pi + . 2 5 ,iK)^+. 25 *. 028 Pj = ' 92 ’ 
so independence is very likely. 

3.2 Independent Vote Count of a Value 

Since even a copier can provide some of the values independently, we compute 
the independent vote for each particular value. In this process we consider the 
data sources one by one in some order. For each source 5, we denote by Pre(S) 
the set of sources that have already been considered and by Post(S) the set of 
sources that have not been considered yet. We compute the probability that the 
value provided by S is independent of any source in Pre(S) and take it as the 
vote count of S. The vote count computed in this way is not precise because if 5 
depends only on sources in Post(S) but some of those sources depend on sources 
in Pre(S), our estimation still (incorrectly) counts S’ s vote. To minimize such 
error, we wish that the probability that S depends on a source S 1 £ Post(S) and S' 
depends on a source S" £ Pre{S) be the lowest. Thus, we use a greedy algorithm 
and consider data sources in the following order. 

1. If the probability of Si — > S 2 is much higher than that of S 2 —> Si, we con¬ 
sider Si as a copier of S 2 with probability Pr(S\ —¥ S 2 \<P) + Pr(S 2 —> Sj |<J>) 
(recall that we assume there is no mutual-copying) and order S 2 before Si. 
Otherwise, we consider both directions as equally possible and there is no 
particular order between Si and S 2 ; we consider such copying undirectional. 

2. For each subset of sources between which there is no particular ordering 
yet, we sort them as follows: in the first round, we select a data source 
that is associated with the undirectional copying of the highest probability 
{Pr(S 1 -4 S 2 1 +Pr(S 2 -4 5j|<i>)); in later rounds, each time we select a 
data source that has the copying with the maximum probability with one of 
the previously selected sources. 

We now consider how to compute the vote count of v once we have decided an 
order of the data sources. Let S be a data source that votes for v. The probability 
that S provides v independently of a source So £ Pre(S) is 1 — c(Pr(S 1 —> So | <£) + 
Pr(So —> S\ !<?)) and the probability that S provides v independently of any data 
source in Pre(S), denoted by I(S), is 

I ( S ) = n s 0 ePFe(S)( l - c ( Pr ( S i -^SoW+PriSo^Sil#))). (13) 
The total vote count of v is Y,seS 

Finally, when we consider the accuracy of sources, we compute the confidence 
of v as follows. 



( 14 ) 


C(v)= £ a'(s)i(s). 

Ses a {v) 

In the equation. I(S) is computed by Equation d 1 3b . In other words, we take only 
the “independent fraction” of the original vote count (decided by source accuracy) 
from each source. 

3.3 Iterative Algorithm 

We need to compute three measures: accuracy of sources, copying between sources, 
and confidence of values. Accuracy of a source depends on confidence of values; 
copying between sources depends on accuracy of sources and the true values se¬ 
lected according to the confidence of values; and confidence of values depends 
on both accuracy of and copying between data sources. 

We conduct analysis of both accuracy and copying in each round. Specifically, 
Algorithm AccuCopy starts by setting the same accuracy for each source and 
the same probability for each value, then iteratively (1) computes copying based 
on the confidence of values computed in the previous round, (2) updates con¬ 
fidence of values accordingly, and (3) updates accuracy of sources accordingly, 
and stops when the accuracy of the sources becomes stable. Note that it is cru¬ 
cial to consider copying between sources from the beginning; otherwise, a data 
source that has been duplicated many times can dominate the voting results in 
the first round and make it hard to detect the copying between it and its copiers 
(as they share only “true” values). Our initial decision on copying is similar to 
Equation 1121 except considering both the possibility of a value being true and 
that of the value being false and we skip details here. 

We can prove that if we ignore source accuracy ( i.e ., assuming all sources have 
the same accuracy) and there are a finite number of objects in G , Algorithm Ac- 
CUCOPY cannot change the decision for an object O back and forth between two 
different values forever; thus, the algorithm converges. 

Theorem 3. Let 5G be a set of good independent sources and copiers that pro¬ 
vide information on objects in G. Let 1 be the number of objects in G and hq be 
the maximum number of values provided for an object by 5?. The AccuVote 
algorithm converges in at most 21 uq rounds on 5? and G if it ignores source 
accuracy. 

Once we consider accuracy of sources, AccuCopy may not converge: when we 
select different values as the true values, the direction of the copying between 
two sources can change and in turn suggest different true values. We stop the 
process after we detect oscillation of decided true values. Finally, we note that 
the complexity of each round is Od^H^plogl^l). 


4 A Case Study 

We now describe a case study on a real-world data seJ3 extracted by searching 
computer-science books on AbeBooks.com. For each book, AbeBooks.com re¬ 
turns information provided by a set of online bookstores. Our goal is to find the 

4 http://lunadong.com/fusionDataSets.htm. 



Table 2. Different types of errors by naive voting. 


Missing authors 

Additional authors 

Mis-ordering 

Mis-spelling 

Incomplete names 

23 

4 

3 

2 

2 


Table 3. Results on the book data set. For each method, we report the precision of the results, 
the run time, and the number of rounds for convergence. AccuCOPY and COPY obtain a high 
precision. 


Model 

Precision 

Rounds 

Time (sec) 

Vote 

.71 

1 

.2 

Sim 

.74 

1 

.2 

Accu 

.79 

23 

1.1 

Copy 

.83 

3 

28.3 

AccuCopy 

.87 

22 

185.8 

AccuCopySim 

.89 

18 

197.5 


list of authors for each book. In the data set there are 877 bookstores, 1263 books, 
and 24364 listings (each listing contains a list of authors on a book provided by a 
bookstore). 

We did a normalization of author names and generated a normalized form that 
preserves the order of the authors and the first name and last name (ignoring the 
middle name) of each author. On average, each book has 19 listings; the number 
of different author lists after cleaning varies from 1 to 23 and is 4 on average. 

We used a golden standard that contains 100 randomly selected books and the 
list of authors found on the cover of each book. We compared the fusion re¬ 
sults with the golden standard, considering missing or additional authors, mis- 
ordering, misspelling, and missing first name or last name as errors; however, we 
do not report missing or misspelled middle names. Table[2]shows the number of 
errors of different types on the selected books if we apply a naive voting (note 
that the result author lists on some books may contain multiple types of errors). 
We define precision of the results as the fraction of objects on which we select the 
true values (as the number of true values we return and the real number of true 
values are both the same as the number of objects, the recall of the results is the 
same as the precision). Note that this definition is different from that of accuracy 
of sources. 

Precision and Efficiency We compared the following data fusion models on this 
data set. 

- VOTE conducts naive voting; 

- SlM conducts naive voting but considers similarity between values; 

- Accu considers accuracy of sources as we described in Section [2] but as¬ 
sumes all sources are independent; 

- COPY considers copying between sources as we described in Section[3] but 
assumes all sources have the same accuracy; 

- AccuCOPY applies the AccuCOPY algorithm described in Section[3] con¬ 
sidering both source accuracy and copying. 

- AccuCOPYSlM applies the AccuCOPY algorithm and considers in addi¬ 
tion similarity between values. 

When applicable, we set a = .2, c = .8, £ = .2 and n = 100. Though, we observed 
that ranging a from .05 to .5, ranging c from .5 to .95, and ranging e from .05 
to .3 did not change the results much. We compared similarity of two author lists 
using 2-gram Jaccard distance. 













Table 4. Bookstores that are likely to be copied by more than 10 other bookstores. For each 
bookstore we show the number of books it lists and its accuracy computed by AccuCopySim. 


Bookstore 

#Copiers 

#Books 

Accuracy 

Caiman 

17.5 

1024 

.55 

MildredsBooks 

14.5 

123 

.88 

COBU GmbH & Co. KG 

13.5 

131 

.91 

THESAINTBOOKSTORE 

13.5 

321 

.84 

Limelight Bookshop 

12 

921 

.54 

Revaluation Books 

12 

1091 

.76 

Players Quest 

11.5 

212 

.82 

AshleyJohnson 

11.5 

77 

.79 

Powell’s Books 

11 

547 

.55 

AlphaCraze .com 

10.5 

157 

.85 

Avg 

12.8 

460 

.75 


Table 5. Difference between accuracy of sources computed by our algorithms and the sampled 
accuracy on the golden standard. The accuracy computed by AccuCopySim is the closest to the 
sampled accuracy. 



Sampled 

AccuCopySim 

AccuCopy 

Accu 

Average source accuracy 

.542 

.607 

.614 

.623 

Average difference 

- 

.082 

.087 

.096 


Table [3]lists the precision of results of each algorithm. AccuCopySim obtained 
the best results and improved over VOTE by 25.4%. SlM, ACCU and COPY each 
extends VOTE on a different aspect; while each of them increased the precision, 
COPY increased it the most. 

To further understand how considering copying and accuracy of sources can af¬ 
fect our results, we looked at the books on which AccuCopy and VOTE gener¬ 
ated different results and manually found the correct authors. There are 143 such 
books, among which AccuCopy gave correct authors for 119 books, VOTE gave 
correct authors for 15 books, and both gave incorrect authors for 9 books. 
Finally, COPY was quite efficient and finished in 28.3 seconds. It took Accu¬ 
Copy and AccuCopySim longer time to converge (3.1, 3.3 minutes respec¬ 
tively); however, truth discovery is often a one-time process and so taking a few 
minutes is reasonable. 

Copying and source accuracy: Out of the 385,000 pairs of bookstores, 2916 
pairs provide information on at least the same 10 books and among them Accu¬ 
CopySim found 508 pairs that are likely to be dependent. Among each such pair 
S] and S2, if the probability of Si depending on S 2 is over 2/3 of the probabil¬ 
ity of S] and Si being dependent, we consider Si as a copier of S2; otherwise, 
we consider Si and Si each has .5 probability to be a copier. Table[4]shows the 
bookstores whose information is likely to be copied by more than 10 bookstores. 
On average each of them provides information on 460 books and has accuracy 
.75. Note that among all bookstores, on average each provides information on 28 
books, conforming to the intuition that small bookstores are more likely to copy 
data from large ones. Interestingly, when we applied VOTE on only the infor¬ 
mation provided by bookstores in Table[4] we obtained a precision of only .58, 
showing that bookstores that are large and copied often actually can make a lot 
of mistakes. 

















Finally, we compare the source accuracy computed by our algorithms with that 
sampled on the 100 books in the golden standard. Specifically, there were 46 
bookstores that provide information on more than 10 books in the golden stan¬ 
dard. For each of them we computed the sampled accuracy as the fraction of 
the books on which the bookstore provides the same author list as the golden 
standard. Then, for each bookstore we computed the difference between its accu¬ 
racy computed by one of our algorithms and the sampled accuracy (Table|5]l- The 
source accuracy computed by AccuCopySim is the closest to the sampled ac¬ 
curacy, indicating the effectiveness of our model on computing source accuracy 
and showing that considering copying between sources helps obtain better source 
accuracy. 


5 Related Work and Conclusions 

This paper presented how to improve truth discovery by analyzing accuracy of 
sources and detecting copying between sources. We describe Bayesian models 
that discover copiers by analyzing values shared between sources. A case study 
shows that the presented algorithms can significantly improve accuracy of truth 
discovery and are scalable when there are a large number of data sources. 

Our work is closely related to Data Provenance , which has been a topic of re¬ 
search for a decade SHU . Whereas research on data provenance is focused on how 
to represent and analyze available provenance information, our work on copy de¬ 
tection helps detect provenance and in particular copying relationships between 
dependent data sources. 

Our work is also related to analysis of trust and authoritativeness of sources 111213110191121 
by link analysis or source behavior in a P2P network. Such trustworthiness is not 
directly related to source accuracy. 

Finally, various fusion models have been proposed in the literature. A comparison 
of them is presented in CD on two real-world Deep Web data sets, showing 
advantages of considering source accuracy together with copying in data fusion. 
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